IN THE CLAIMS 

Claims 1, 6, 10, 12, 13, 14, 24, 28 and 36 are amended herein. Claims 2, 7, 
15, and 26 are cancelled. All pending claims are produced below. 



1 . (Currently Amended) A system for finding compounds in a text 
corpus, comprising: 

a vocabulary comprising tokens extracted from a text corpus; and 
a compound finder configured to iteratively identify identifying 

compounds having a plurality of lengths within the text corpus, 
each compound comprising a plurality of tokens, comprising: 
an iterator configured to select »-grams having a same 

length that is less than a length of »-grams selected 
during a previous iteration; 
an M-gram counter configured to evaluate evaluating a 

frequency of occurrence for one or more n-grams 
having the same length in the text corpus, each n- 
gram comprising at least one token tok e ns selected 
from the vocabulary; and 
a likelihood evaluator configured to determine determining 
a likelihood of collocation for one or more of the n- 
grams having a-the same length, adding the a subset 
of n-grams having a high highest likelihood as 
compounds to the vocabulary and rebuilding the 
vocabulary based on the added compounds. 

2. (Cancelled) 



3 . (Currently Amended) A system according to Claim 1 , wherein 
only some of the subset of «-grams having a high highest likelihood are added as 
compounds to the vocabulary. 



4. (Original) A system according to Claim 1 , wherein the likelihood 
of collocation as a likelihood ratio X is computed in accordance with the formula: 



where L(H,) is a likelihood of observing H t under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and //is a 
pair of tokens. 

5. (Original) A system according to Claim 4, wherein the L{H C ) is 
determined, comprising dividing the «-gram into n-\ pairings of segments, 
calculating a likelihood of collocation for each pairing of segments, and selecting 
the maximum likelihood of collocation of the pairings as L(H ( ). 

6. (Currently Amended) A method for finding compounds in a text 
corpus, comprising: 

building a vocabulary comprising tokens extracted from a text corpus; 
and 

iteratively identifying compounds having a plurality of lengths within 
the text corpus, each compound comprising a plurality of 
tokens, comprising: 




selecting n-grams having a same length that is less than a 
length of »-grams selected during a previous 



iteration : 



evaluating a frequency of occurrence for one or more n- 
grams having the same length in the text corpus, 
each «-gram comprising at least one token tokens 
selected from the vocabulary; 



determining a likelihood of collocation for one or more of 



the n-grams having a the same length; and 



adding the a subset of n -grams having a high highest 
likelihood as compounds to the vocabulary and 
rebuilding the vocabulary based on the added 
compounds. 

7. (Cancelled) 

8. (Original) A method according to Claim 6, further comprising: 
adding only some of the subset of the «-grams having a high highest 

likelihood as compounds to the vocabulary. 

9. (Original) A method according to Claim 6, further comprisingT- 
computing the likelihood of collocation as a likelihood ratio X in accordance with 
the formula: 

where L(H) is a likelihood of observing H under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and His a 
pair of tokens. 

10. (Currently Amended) A method according to Claim 9, further 
comprising-^ determining L(H C ), comprising: 

dividing the n-gram into n-\ pairings of segments; 
calculating a likelihood of collocation for each pairing of segments; 
and 

selecting the maximum likelihood of collocation of the pairings as 

L(H c y 

1 1 . (Original) A computer-readable storage medium holding code for 
performing the method according to Claim 6. 

12. (Currently Amended) An apparatus for finding compounds in a 
text corpus, comprising: 



means for building a vocabulary comprising tokens extracted from a 
text corpus; and 

means for iteratively identifying compounds having a plurality of 
lengths within the text corpus, each compound comprising a 
plurality of tokens, comprising: 

means for selecting n-grams having a same length that is 
less than a length of »-grams selected during a 
previous iteration; 
means for evaluating a frequency of occurrence for one or 
more w-grams having the same length in the text 
corpus, each «-gram comprising at least one token 
tokens selected from the vocabulary; 
means for determining a likelihood of collocation for one 
or more of the «-grams having a the same length; 
and 

means for adding a subset of ike-w-grams having a high 

highest likelihood as compounds to the vocabulary 
and means for rebuilding the vocabulary based on 
the added compounds. 

13. (Currently Amended) A system for identifying compounds 
through iterative analysis of measure of association, comprising: 
a stored limit on a number of tokens per compound 

an iterator initially specifying a limit on a number of tokens per 
compound for an iteration and decreasing the limit for a 
subsequent iteration; and 
a compound finder configured to iteratively evaluate evaluating 
compounds within a text corpus, comprising: 

an «-gram counter configured to determine determining a 
number of occurrences of one or more n-grams 
within the text corpus, each n-gram comprising up- 
te a number of tokens up to the limit for the 



iteration a maximum number of tokens , which are 
eaefe at least in part provided in a vocabulary for the 
text corpus; 

a likelihood evaluator configured to identify identifying at 
least one w-gram comprising a number of tokens 
equal to the limit for the iteration b ased on the 
number of occurrences and determining a measure 
of association between the tokens in the identified 
/7-gram^ and adding each identified «-gram with a 
sufficient measure of association to the vocabulary 
as a compound token; and rebuilding the vocabulary 
based on the added compound tokens and adjusting 

14. (Currently Amended) A system according to Claim 13, further 
comprising: 

a stored upper limit on a number of identified «-grams; and 
a limiter identifying a number of w-grams up to the stored upper limit 
based on the number of occurrences. 

15. (Cancelled) 

16. (Original) A system according to Claim 13, wherein the measure 
of association between the tokens in the identified /7-gram comprises a likelihood 
ratio k. 

17. (Original) A system according to Claim 16, wherein the likelihood 
ratio A, is calculated in accordance with the formula: 



where L(Hj) is a likelihood of observing Hi under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and //is a 
pair of tokens. 

18. (Original) A system according to Claim 17, wherein, for each pair 
of tokens, h, h, in the identified M-gram, the independence hypothesis comprises 
P(t 2 \t l ) = p{t 2 I t x ) and the collocation hypothesis comprises P(t 2 \t l )> p{t 2 \ t x ). 

19. (Original) A system according to Claim 17, wherein the L(Hi) is 
computed for each pair of tokens, t\, h, in the identified «-gram in accordance 
with the formula: 

L(t,,t^ form compound) 

arg max — -, r . 

l(h,) L{n - gram does not form compound ) 

20. (Original) A system according to Claim 13, further comprising: 
an initial vocabulary comprising a plurality of tokens extracted from 

the text corpus. 

2 1 . (Original) A system according to Claim 20, further comprising: 
a parser parsing the tokens from the text corpus. 

22. (Original) A system according to Claim 13, further comprising: 

a filter determining the number of occurrences of one or more /7-grams 
within the text corpus for only unique /7-grams. 

23. (Original) A system according to Claim 13, wherein each text 
corpus comprises a plurality of documents comprising one of a Web page, a news 
message and text. 

24. (Currently Amended) A method for identifying compounds 
through iterative analysis of measure of association, comprising: 

iteratively specifying a limit on a number of tokens per compound for 
an iteration and decreasing the limit for a subsequent iteration; 



specifying a limit on a number of tokens per compound ; and 
iteratively evaluating compounds within a text corpus, comprising: 

determining a number of occurrences of one or more n- 
grams within the text corpus, each n-gram 
comprising up to a number of tokens up to the limit 
for the iteration maximum number of tokens , which 
are eaeh at least in part provided in a vocabulary for 
the text corpus; 

identifying at least one «-gram comprising a number of 
tokens equal to the limit for the iteration b ased on 
the number of occurrences and determining a 
measure of association between the tokens in the 
identified n-gram; and 

adding each identified n-gram with a sufficient measure of 
association to the vocabulary as a compound token 7 
and rebuilding the vocabulary based on the added 
compound tokens and adjusting the limit . 

25. (Original) A method according to Claim 24, further comprising: 
providing an upper limit on a number of identified «-grams; and 
identifying a number of «-grams up to the upper limit based on the 

number of occurrences. 

26. (Cancelled) 

27. (Original) A method according to Claim 24, wherein the measure 
of association between the tokens in the identified /7-gram comprises a likelihood 
ratio X. 

28. (Currently Amended) A method according to Claim 27, further 
comprising-^ calculating the likelihood ratio X in accordance with the formula: 



where L(Hj) is a likelihood of observing Hi under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and H is a 
pair of tokens. 

29. (Original) A method according to Claim 28, wherein, for each pair 
of tokens, t\, t 2 , in the identified «-gram, the independence hypothesis comprises 
P(t 2 \t l ) = p{t 2 | fj ) and the collocation hypothesis comprises P(t 2 \t l )> p{t 2 \ t x ). 

30. (Original) A method according to Claim 28, further comprising: 
computing the L(Hj) for each pair of tokens, t\, ti, in the identified n- 

gram in accordance with the formula: 

Lit, ,t 2 form compound) 

arg max — \±U+ *L 

l(h,) L[n - gram does not form compound ) 

3 1 . (Original) A method according to Claim 24, further comprising: 
constructing an initial vocabulary comprising a plurality of tokens 

extracted from the text corpus. 

32. (Original) A method according to Claim 3 1 , further comprising: 
parsing the tokens from the text corpus. 

33. (Original) A method according to Claim 24, further comprising: 
determining the number of occurrences of one or more «-grams within 

the text corpus for only unique «-grams. 

34. (Original) A method according to Claim 24, wherein each text 
corpus comprises a plurality of documents comprising one of a Web page, a news 
message and text. 



35. (Original) A computer-readable storage medium holding code for 
performing the method according to Claim 24. 



36. (Currently Amended) An apparatus for identifying compounds 
through iterative analysis of measure of association, comprising: 

means for specifying a limit on a number of tokens per compound for 
an iteration and decreasing the limit for a subsequent iteration 
specifying a limit on a number of tokens per compound ; and 
means for iteratively evaluating compounds within a text corpus, 
comprising: 

means for determining a number of occurrences of one or 
more «-grams within the text corpus, each n-gram 
comprising up to a number of tokens up to the limit 
for the iteration maximum number of tokens , which 
are eaeh- at least in part provided in a vocabulary for 
the text corpus; 

means for identifying at least one «-gram comprising a 

number of tokens equal to the limit for the iteration 
based on the number of occurrences and means for 
determining a measure of association between the 
tokens in the identified «-gram; and 

means for adding each identified /7-gram with a sufficient 
measure of association to the vocabulary as a 
compound token 7 and means for rebuilding the 
vocabulary based on the added compound tokens 
and means for adjusting the limit . 



